Reconfigurable general purpose processor having time restricted configurations

ABSTRACT

A processor includes a reconfigurable field of data processing cells. A register is provided where the register has a data stream memory designed to store a data stream and/or parts thereon. The register may be a RAM PAE.

FIELD OF THE INVENTION

The present invention relates to reconfigurable multidimensional logicfields and their operation.

BACKGROUND INFORMATION

Reconfigurable elements are designed differently depending on theapplication to be executed and are designed to be consistent with theapplication. A reconfigurable architecture is understood in the presentcase to refer to modules or units (VPUs) having a configurable functionand/or interconnection, in particular integrated modules having aplurality of arithmetic and/or logic and/or analog and/or memory and/orinternally/externally interconnected modules arranged in one or moredimensions and interconnected directly or via a bus system.

The generic type represented by these modules includes in particularsystolic arrays, neural networks, multiprocessor systems, processorshaving multiple arithmetic units and/or logic cells and/orcommunicative/peripheral cells (IO), interconnection and networkmodules, e.g., crossbar switches as well as known modules of the FPGA,DPGA, Chameleon, VPUTER, etc. types. Reference is made in particular inthis connection to the following patents and applications by the presentapplicant: DE 44 16 881 A1, DE 197 81 412 A1, DE 197 81 483 A1, DE 19654 846 A1, DE 196 54 593 A1, DE 197 04 044.6 A1, DE 198 80 129 A1, DE198 61 088 A1, DE 199 80 312 A1, PCT/DE 00/01869, DE 100 36 627 A1, DE100 28 397 A1, DE 101 10 530 A1, DE 101 11 014 A1, PCT/EP 00/10516, EP01 102 674 A1, DE 198 80 128 A1, DE 101 39 170 A1, DE 198 09 640 A1, DE199 26 538.0 A1, DE 100 50 442 A1, as well as PCT/EP 02/02398, DE 102 40000, DE 102 02 044, DE 102 02 175, DE 101 29 237, DE 101 42 904, DE 10135 210, EP 01 129 923, PCT/EP 02/10084, DE 102 12 622, DE 102 36 271, DE102 12 621, EP 02 009 868, DE 102 36 272, DE 102 41 812, DE 102 36 269,DE 102 43 322, EP 02 022 692, PACT40. Reference is made to the documentsbelow by using the applicant's internal reference notation. These areherewith incorporated to the full extent for disclosure purposes.

The aforementioned architecture is used as an example for illustrationand is referred to below as a VPU. This architecture is composed of anyarithmetic or logic cells (including memories) and/or memory cellsand/or interconnection cells and/or communicative/peripheral (IO) cells(PAEs) which may be arranged to form a one-dimensional ormultidimensional matrix (PA), which may have different cells of anydesign. Bus systems are also understood to be cells here. The matrix asa whole or parts thereof are assigned a configuration unit (CT, loadlogic) which configures the interconnection and function of the PA. TheCT may be designed as a dedicated unit according to PACT05, PACT10,PACT17, for example, or as a host microprocessor according to P 44 16881.0-53, DE 101 06 856.9; it may be assigned to the PA and/orimplemented with or through such a unit.

SUMMARY

The present invention relates to a processor model for reconfigurablearchitectures based on the model of a traditional processor in someessential points. For better understanding, the traditional model willbe first considered in greater detail. Resources external to theprocessor (e.g., main memory for programs and data, etc.) are notconsidered here.

A processor executes a program in a process. The program includes afinite quantity of instructions (this quantity may include multipleinstances of elements) as well as information regarding the order inwhich the instructions may follow one another. This order is determinedprimarily by the linear arrangement of the instructions in the programmemory and the targets of jump instructions.

Instructions are usually identified by their address. As an example,FIG. 1 (a) shows a program written in VAX Assembler for exponentiation.

A program may also be interpreted as oriented graphs, where theinstructions form the nodes and their order is modeled as edges of thegraph. This graph is shown in FIG. 1 (b). The graph has a definite startnode and a definite end node (not shown in the figure; indicated by thearrows). The edges may additionally be marked with transitionprobabilities. This information may then be used for jump prediction.The jump prediction may in turn be used for preloading configurationsinto the memory of the CT of a VPU (see patent application PACT10, thefull content of which has been included for disclosure purposes) and/orfor preloading configurations into the configuration stack of the PAE(according to patent applications PACT13, PACT17, PACT31, the fullcontent of which is included for disclosure purposes). By preloadingconfigurations into the local memory of the CT (see PACT10, 17) and/orinto the PAE's local configuration cache (PACT17, 31), theconfigurations may then be called more rapidly as needed, which yields agreat increase in efficiency.

The linear arrangement of the instructions in the memory results in moredependences than absolutely necessary; e.g., in the example shown here,instructions DECL and MULL2 are mutually independent. This is notindicated by the graph in FIG. 1 (b). The model may be expandedaccordingly by division nodes and combination nodes, as illustrated inFIG. 1 (c).

Processors today implement such possibilities of parallel execution inhardware to some extent and distribute the operations among variousarithmetic logic units. The model from FIG. 1 (b) will be used forfurther consideration. The discussion of the additional complexity ofdivision and combining will be shifted to a later point in time. Aprocess also needs other resources in addition to the program for itsexecution. Within the processor, these include the registers and thestatus flags.

These resources are used to convey information between the individualprogram instructions. The task of the operating system is to ensure thatthe resources needed for execution of a process are available to it andare released again when the process is terminated. Processors todayusually have only one set of registers, so that only one process may runon the processor at a time. It is possible for the instructions of twodifferent processes to be executable in any order as long as bothprocesses use disjunct resources (e.g., if process 1 is using registers0-3 and process 2 is using registers 4-7).

Instructions of a processor usually have the following properties:

-   -   An instruction is not interrupted during execution.    -   The execution time for all instructions does not exceed a        certain maximum value.    -   Invalid instructions are recognized by the processor.

An object of the present invention is to provide a novel approach forcommercial use.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 a shows a program written in VAX assembler.

FIG. 1 b shows the program interpreted as a graph.

FIG. 1 c shows an expanded model.

FIG. 2 shows a subprogram in graphic representation.

FIG. 3 shows an inserted subprogram call.

FIG. 4 a shows the position of a power at the beginning of a CIW.

FIG. 4 b shows how the pointer position of a register may appear at theend of a CIW.

FIG. 5 a shows a register before a write access.

FIG. 5 b shows that existing data may be deleted in such a was that awrite access begins with an empty vector.

FIG. 5 c shows the write data may be appended to the existing content.

FIGS. 5 d and 5 e show the state of the register after successful writeoperations.

FIGS. 6 a-6 e show read/write accesses.

FIG. 7 shows an example of a FIFO stage.

FIG. 8 shows the connection of multiple stages

FIG. 9 shows a possible cache content during operation.

FIG. 10 a shows the free list as completely full.

FIG. 10 b shows memory parts affected.

FIG. 11 a shows a state prior to deletion.

FIG. 11 b shows a state after deletion.

DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS

2. Transfer of the Model to the VPU Architecture

An exemplary VPU architecture is a reconfigurable processor architectureas described in, for example, PACT01, 02, 03, 04, 05, 07, 08, 09, 10,13, 17, 22, 23, 24, 31. As mentioned above, the full content of thesedocuments is herewith incorporated for disclosure purposes. Reference isalso made to PACT11, 20 and 27, which describe corresponding high-levellanguage compilers, as well as a PACT21 which describes a correspondingdebugger. The full content of these documents is also included here fordisclosure purposes.

The traditional instruction is replaced by a configuration in the knownsense. For example, the aforementioned DE 101 39 170, which isincorporated herein by reference, describes that, for suchconfigurations, a partitioning may be calculated with the aim to processas many data packets without changing the configured commands. Forexample, for an algorithm and/or a series of statements, a breakdown ofthe algorithm and/or series of statements may be chosen so that as manydata packets can be processed without changing the configuration. Theconfiguration is referred to in the following discussion as a complexinstruction word (CIW). The edges of graphs in FIG. 1 (b) are formed bytrigger signals to the CT. It is thus possible to implement a completeprogram by having the CT and/or the configuration cache of the PAEs loadthe following CIW after successful processing of one CIW (see PACT31and/or as described below).

It was recognized first how a correspondence of registers of traditionalprocessors could be implemented on the VPU architecture. It wasdiscovered that an essential prerequisite for register implementation isbased on the following:

-   -   Since the VPU operates essentially on data streams, a register        must be capable of storing a data stream and/or parts thereof.    -   A register must be capable of being allocated and released. It        must remain occupied as long as the program is running on the        VPU (HW support of the resource management of the operating        system).    -   Simultaneous reading and writing (read-modify-write) of the same        register should be possible.

It is explained how this may be achieved in a processor and the use ofsuitably modified RAM PAEs is also proposed according to the presentinvention. These should first be used as registers.

A detailed description of the register PAEs preferably by expandedand/or modified RAM PAEs is given in section 4 below. A configuration(CIW) is removed from the array at the moment when it requests the nextCIW from the CT via a trigger. The reconfig trigger (see PACT08) may begenerated either via the reconfig port of an ALU PAE or implicitly bythe CT. In optimally designed versions, this should fundamentally takeplace from the CT.

Just as instructions on a traditional processor are not interrupted, aCIW preferably also runs on the VPU without interruption until itrequests the next CIW via a trigger to the CT. It is not terminatedprematurely. To be able to nevertheless ensure a regular change ofinstructions (which will be needed later for multitasking), the maximumexecution time of a CIW has an upper limit. The second property of aninstruction is thus required. It is preferably the function of thecompiler to ensure that each CIW generated meets this condition. A CIWthat violates this condition is an invalid instruction. It may berecognized by the hardware during execution, e.g., via a watchdog timer,which generates a trigger more or less as a warning signal after acertain amount of time has elapsed.

The warning signal is preferably managed as a TRAP by the hardwareand/or the operating system. The signal is also preferably sent to theCT. An invalid CIW is preferably terminated via a reconfig trigger,which causes a reset-like deletion of all configurations in the PAand/or an exception is also preferably sent to the operating system.

Since CIWs are very long, the instruction fetch time (time between thereconfiguration trigger of the PAEs to the CT (see PACT08) andconfiguration is loaded in the FILMO cache) and instruction decode time(distribution of the configuration data from the FILMO cache (seePACT10) into the configuration registers of the PAEs) are also verylong. Therefore, utilization of the execution units (i.e., the PA in theVPU processor model) by a process is not very high. How this problem maybe solved with multiple processors is described in section 6 below.

3. Subprograms

A subprogram in the graphic representation is a partial graph of aprogram having uniquely defined input nodes. The edge of the subprogramcall within the graph is thus statically known. The continuing edge atthe output node of the subprogram, however, is not statically known.This is shown in FIG. 2. The edges of the main program (0201/0202) tothe subprogram (0205) are present, but the continuation (0206) after thesubprogram is not known to subprogram 0205. The particular continuationis fixedly connected to the subprogram call (indicated by dashed linesand dotted lines). It must be inserted in a suitable manner into thegraphs before reaching the input node (0207, 0208). This is illustratedin FIG. 3.

In traditional processors, this is usually accomplished by storing theaddress of the instruction following the subprogram (this is preciselythe missing edge) in a call stack when the subprogram is called (call,0203, 0204). The address may be called from there by a return.

A stack PAE is thus needed when this principle is applied to the VPU.Like register PAEs, this is a process resource and is managed as such.The CIW, which causes the subprogram call when terminated, configuresthe return edge on the stack PAE. Through a trigger, the last CIW of thesubprogram causes the stack PAE to remove the top entry from the stackand send it as a reconfiguration call to the CT.

In implementing a stack, one of the following methods may be used, forexample:

-   -   An implementation within the CT. The stack is implemented in the        software or as a dedicated hardware unit within the CT. A        special config ID (e.g., −1) may be reserved as the return. When        the CT receives this ID, it replaces it by the top entry of its        locally managed stack.    -   A stack PAE, which may be designed as a modified RAM PAE        according to PACT13 (FIG. 2), for example. Stack overflow and        stack underflow are exceptions which are preferably forwarded to        the operating system.

4. The Register PAE

A traditional processor register contains a data word at each point intime. An instruction is able to read, write or modify the contents ofthe register (read-modify-write).

A VPU register will have the same properties, but instead of a singlevalue, according to the present invention it will contain a value vectoror parts thereof. It is possible and usually preferable for a VPUregister to be organized as a type of FIFO. In certain cases, however,random access may also be necessary. The three types of register accessmentioned above are explained in detail below. Random access is notdiscussed here.

Read access. At the start of a CIW, the register contains a data vectorof unknown length. The individual elements of the vector are removedsequentially. With the last element of the vector, a trigger isgenerated, indicating that the register is now empty and the CIW mayterminate. The status of the register may be characterized using threepointers which point to the first entry (0403) in the data vector, thelast entry (0401) and the current entry (0402). The position of thepointer at the beginning of a CIW is shown as an example in FIG. 4 (a),where the pointer for the current entry points at the first entry.

FIG. 4 (b) shows in a first example how the pointer position of aregister may appear at the end of a CIW. The vector has not been readcompletely in the case illustrated here.

Consequently, a decision must be made regarding what happens with theregister contents. There are preferably the following options:

-   -   The register is emptied. All unprocessed data is deleted. The        pointer for the current entry points at the last entry.    -   The register is reset at the original state. The next CIW may        thus again access the full data vector. The pointer for the        current entry is reset to point at the first entry.    -   Only the data already read is removed from the register. The        unread data is then available for the next CIW. The pointers are        not modified. Subsequently, the values between the first entry        and the current entry are removed from the register. They are        then no longer available for further operations.

The third option is of interest in particular when a CIW is unable tocompletely process the data vector because of the maximum execution timefor a CIW. See also section 7.

Write access. Data here is written sequentially into the register. Atrigger is generated when the filling status of the register has reacheda certain level. Depending on the CIW, this may be one of the followingpreferred possibilities:

-   -   The register is completely full.    -   There are still precisely n entries in the vector that are free.        This takes into account the latency time in the CIW through        which n values after the trigger are still running to the        register.    -   The register is m % full.

A CIW which attempts to write into a completely full register is invalidand is terminated with an exception (illegal opcode). At the start ofthe CIW, the status of the register should be determined. FIG. 5 (a)shows a register before a write access which still contains data.Existing data may be deleted in such a way that the write access beginswith an empty vector (FIG. 5 (b)). As an alternative, the write data mayalso be appended to the existing content. This is shown in FIG. 5 (c).This case is of interest when the preceding CIW was unable to generatethe complete vector because of the maximum execution time.

FIGS. 5 (d) and (e) show the state of the register after successfulwrite operations. The newly written data is indicated here withhatching.

Simultaneous read/write access. The restriction to pure read access orwrite access requires a greater number of registers than necessary. Whendata is removed from a register by read access, this yields locationswhich may be occupied by write data. It is only necessary to ensure thatwrite data cannot be read again by the same CIW, i.e., there is a clearseparation between the read data of a CIW and the write data of the CIW.For this purpose a virtual dividing line (0601) is introduced into theFIFO. The register has been read completely when this dividing linereaches the output of the FIFO. Suitable means may be implemented fordefining this virtual dividing line.

If a write access for a data word is not executable because the registeris still blocked by unread read data, the CIW is terminated and anillegal opcode exception is generated. The behavior of the register isotherwise exactly the same as that described for read and write access.In addition, one should specify what is to happen with the virtualdividing line between the read data and write data. This dividing linemay remain at the location it is in at the moment. This is beneficial ifa CIW must be terminated because of the time restriction. As analternative, the dividing line may be set at the end of all data.

Combined read/write accesses are problematical, however, if the CIW hasbeen terminated with an exception. In this case, it is no longer readilypossible to reset the registers to their values at the start of the CIW.Debugging may then be hindered at the least (see also the followingdiscussion in section 8).

FIG. 6 illustrates the functioning using an example, where the virtualdividing line is labeled 0601. At the beginning, the register containsdata (a) which is subsequently read partially (b) or completely (c).Newly written and read entries are indicated here by different types ofhatching. FIGS. 6 (d) and (e) show the state of the register after therequired pointer update, which alters the position of the dividing line.This is not an explicit step, but is shown here only for the purpose ofillustration. The entries that have been read must be removedimmediately to make room for the new entries to be written.

A process, i.e., a program which shares resources with other programs ina multitasking operation in particular, must allocate each requiredregister before it may be used. This is preferably accomplished by anadditional configuration register within the RAM PAE and/or the registerPAE, an entry also being made indicating to which process the registernow belongs. This configuration is retained over reconfigurations. Theregister must be explicitly enabled by the CT. This happens ontermination of a process, for example. With the configuration of eachCIW, the registers must be notified of which process the CIW belongs to.This makes it possible to switch between multiple register sets. Thisprocess is described in greater detail in section 6 below.

5. Interrupts

A distinction is made between two different types of interrupts. Firstthere are hardware interrupts, where the processor must respond to anexternal event. These are usually processed by the operating system andare not visible for the ongoing processes. They are not discussedfurther here. The second type is the software interrupts which arefrequently used to implement asynchronous interactions between theprocess and the operating system. For example, it is possible under VMSto send a read request to the operating system without waiting for theactual data. As soon as the data is present, the operating systeminterrupts the running program and calls a procedure of the programasynchronously. This method is known as an asynchronous system trap(AST).

This method may also be used in the same way on the VPU. To do so,support may be provided in the CT. The CT knows whether an asynchronousroutine must be called up for a process. In this case, the next requestcoming from the array is not processed directly but instead is stored.

Instead, a sequence of CIWs is inserted, which first saves the processorstatus (the register contents), which executes the asynchronous routineand which then restores the register content. The original request maybe subsequently processed.

6. Multitasking

As described above in section 2, the VPU architecture may, under somecircumstances, not be optimally utilized with only one process becausevery long loading and decoding times occur, e.g., due to the length ofthe CIWs. This problem may be solved by simultaneous execution ofmultiple processes. According to the present invention, several registersets are provided on the VPU for this purpose, making it possible tosimply switch between register sets when changing context withoutrequiring any complex register clearance and loading operations. Thisalso makes it possible to increase the processing speed.

During execution of CIWs of the processes, enough time is available toretrieve the instructions of the current process and distribute them viathe FILMO to the PAEs and/or to load them from the configuration cacheinto the PAEs (see PACT31). The optimum number of register sets may bedetermined as a function of the average execution time of a CIW and theaverage loading and decoding times of the CIWs.

The latency time may be compensated by a larger number of register sets.It is important for the functioning of the method that the average CIWrunning time is greater than the amount of time effectively needed forloading and/or decoding the CIW in each case.

The corresponding registers of the different register sets are then atthe same PAE address for the programmer. In other words, at any point intime, only the registers of one register set may be used. The change incontext between the register sets may be implemented by transmitting thecorresponding context to the PAEs before each CIW. The context switchmay take place automatically as depicted in detail by the PUSH/POPoperations according to PACT11 and/or by a special RAM/register PAEhardware, as depicted in PACT13 FIG. 21. Both cases involve a similarstack design in the memory. Each stack entry stores the data of aprocess. A stack entry includes the complete content of all registers,in other words, all memory cells of all memories which function asregisters for a process. Likewise, according to PACT11, a stack entrymay also contain PA-internal data and states.

In general, more processes will be present on a system than there areregister sets in the processor. This means that a process mustoccasionally be removed from the processor. To do so, as in the case ofthe software interrupt, an edge of the program graph is divided by theCT. The register contents of the process are saved and the processorresources i (registers, stack PAEs, etc.) allocated by the process arefreed again. The resources thereby freed are then allocated by anotherprocess. The register contents stored for this process are then writtenback again and the process is continued on this divided edge. Theregister contents may then be saved and reloaded via CIWs.

7. CIW and Loops

On the basis of the property required above, namely that a CIW mustterminate after a certain maximum number of cycles at the latest,general loops may not be translated directly into a CIW. It is alwayspossible to translate the loop body into a CIW and to execute the loopcontrol via reconfiguration. However, this often means a considerablesacrifice in terms of performance. This section shows how a loop may bereshaped to minimize the number of reconfigurations.

The following program fragment is assumed below:

while (condition) { something; }the running time of “condition” should be determined as “something” orit should be possible to make an upper estimate. The loop may then bereformulated as follows:

while (1) { if (!condition) goto finish; something; } finish:

The body of the loop may now be iterated as often as allowed by themaximum running time of the CIW. A new variable z is introduced for thispurpose; this variable does not occur either in “condition” or in“something.” The program now looks as follows:

while (1) { for (z=0; z<MAX; z++) { if (!condition) goto finish;something; ) ) finish:

The “for” loop has a maximum running time which may be determined by thecompiler. It may therefore be mapped onto a CIW. MAX is determined bythe compiler as a function of the maximum running time and theindividual running times of the instructions.

The resulting CIW has two output edges. The output via goto leads to thenext CIW; the output via the regular end of “for” forms an edge onitself. The endless loop is implemented via this edge.

8. Debugging

In the traditional processor, debugging is performed on an instructionbasis, i.e., the sequence of a program may be interrupted at any timebetween two instructions. At these interruption points, the programmerhas access to the registers, may look at them and modify them.Interruption points may be implemented in various ways. First theprogram may be modified, i.e., the instruction before which theinterruption is to occur, is replaced by other instructions which callthe debugger. In the graphic model, this corresponds to replacing onenode with another node or with a partial graph. Another method is basedon additional hardware support, where the processor is notified of whichinstruction the program is to be interrupted at. The correspondinginstruction is usually identified by its address.

Both possibilities may be applied to the VPU according to the presentinvention. One CIW may be replaced by another CIW, by action of thedebugger, for example. This CIW may then, for example, copy the registercontents into the main memory, where they may either be analyzed using adebugger external to the VPU or alternatively the debugger may also runon the VPU.

In addition, hardware support, which identifies CIWs on the basis oftheir ID when requested and then calls up the debugger, may also beprovided in the CT. In addition, an interruption may also be fixedlyattached to an edge of the graph because interruptions are presentexplicitly in contrast with traditional program code.

The type of debugging described above is completely adequate fortraditional processors because the instructions are usually very simple.There is a sufficiently fine resolution of the observable points. Inaddition, the programmer may rely on the individual instructions beingcorrect (usually ensured by the processor manufacturer).

On the VPU, however, it is possible for the programmer to define theCIWs which form a type of “processor instructions.” Accordingly,instructions defined in themselves in this way may be defective.Debugging of the individual instructions is thus preferably designed inthe manner referred to below as microcode debugging. Microcode debuggingis designed in such a way that the programmer gains access to allinternal registers and data paths of the processor. It has beenrecognized that the complexity necessary for this is readily justifiedby the increased functionality.

Hardware support for this is possible but very complex and is notappropriate for pure debugging purposes. Therefore, as an alternative,the status of the processor before the instruction in question is savedand the actual instruction is executed on a software simulator. This isthe preferred method of debugging VPUs according to PACT11. The data andstates are preferably transferred to the debugger via a bus interface,memory and/or preferably via a debugging interface such as JTAG. Adebugger according to PACT21 is preferably used, preferably containing amixed-mode debugger having an integrated simulator for processing themicro debugging.

In a suitable programming model, the debugger may also be called when anexception occurs within an instruction. It is appropriate here that theregisters may be reset back to the state before the start of theinstruction and that no other side effects have occurred. Then theinstruction in question may be started in the software simulator andsimulated until up to the occurrence of the exception.

Particularly preferred debugging mechanisms are described in detail inPACT21.

Microcode debugging may preferably be implemented by configuring adebugging CIW before or after processing a CIW. This debugging CIW firstreceives all the states (e.g., in the PAES) and then writes them into anexternal memory through a suitable configuration of the interconnectionresources. The PUSH/POP methods described in PACT11 may be used hereparticularly preferably. This may preferably take place via an industrystandard interface such as JTAG. Then a debugger may receive the datafrom the memory or via the JTAG interface and, if necessary, simulate itfurther incrementally in conjunction with a simulator (see PACT21), thuspermitting microcode debugging.

9. Distributed Configuration Cache

On the basis of the central configuration cache in FILMO, it takes arelatively long time when using such a cache, which is not obligatory,until a configuration is distributed to the individual PAEs of a PAC.This section will now describe a preferred method for shortening thisperiod of time. A similar alternative or additional method is alsoalready described in PACT31, the full content of which is herewithincorporated for disclosure purposes.

For this purpose, each PAE has its own local cache which stores theconfiguration data of various configurations for precisely this PAE. Thefact that a PAE has not received any data from a configuration is alsostored. For each configuration requested, the cache may thus make one ofthe following statements:

-   -   The configuration data is present in the cache.    -   No data is needed for this configuration.    -   Nothing is known about this configuration.    -   Configuration data is needed but it is not available in the        cache (e.g., due to the length of the configuration, RAM        preload, etc.).

The last two statements may be combined here. With both statements, thecode or the fact that no code is needed must be requested. An order fora configuration is sent by the FILMO as a broadcast on the test bus toall PAEs. If all PAEs have the configuration in their local cache, itmay be started via broadcast on the config bus. In the ideal case, thestart of the configuration thus requires the transmission of only asingle configuration word.

If a PAE does not have the configuration data, this fact is reportedback to the FILMO. In the simplest case, this is done via a reject onthe existing line. The FILMO then knows on the basis of this signal thatat least one PAE of the PAC does not have the configuration data. It maythen transmit the complete data. As an alternative, each PAE may triggerseparately a request for the data. In this case a suitable compromisemust be made between the number of requests and the quantity ofconfiguration data to be transmitted. Small PAC sizes are advantageoushere because of the lower latency on the configuration bus.

Design of the Cache

A cache is generally always composed of two parts. One part contains theactual data (here the configuration words, 0902) while the other partcontains management information (here the configuration numberscontained as well as their age, 0901).

First the management part is described.

It is desirable for the configuration which has not been used for thelongest period of time to be removed from the cache if this isnecessary. As long as only new configurations are requested, the entriesin the FIFO are sorted correctly. If a configuration is requested forwhich there is already an entry in the FIFO, this entry must be removedfrom the FIFO. It is then reinserted again at the end. FIG. 7 shows anexample of a FIFO stage modified for this purpose. The modules shownwith hatching are in addition to a normal FIFO stage according to therelated art. They compare via the comparator (0701) the configurationnumber of the data content of the stage with the requested configurationnumber and, if they are the same, generate an ack (0702) for that stage.Thus, the data of the stage is read via the multiplexer (0703) and allthe other values move up by one stage. The entries in this FIFO alsocontain additional information in addition to the configuration number.This is either a pointer (address) to the configuration data or one ofthe two possibilities “no data necessary” (e.g., coded as 0) or “datamust be requested,” (e.g., −1). FIG. 8 shows the connection of multiplestages, where the read chain is initialized with the requiredconfiguration number and the status −1. This value then comes outunchanged at the output of the read chain exactly when the configurationnumber is not stored in the FIFO. The output of the read chain may thusbe used in any case to write the configuration number into the FIFO.Signal ack_in is activated when the FIFO is full and the desiredconfiguration number is not in the FIFO. This is the only case when theoldest entry must be removed from the FIFO because the management memoryis full. The actual data memory is organized as a chained list becauseof the different number of configuration words per configuration. Otherimplementations are also conceivable. A chained list may then beimplemented easily as a RAM by storing the address of the following dataword in addition to the data.

In addition to the lists for the actual configurations, a free list iscarried, listing all the entries which are not being used. This must beinitialized first after a reset.

FIG. 9 shows a possible cache content during operation. Free entries inthe data memory are white, while entries occupied by a configuration areshown with hatching. Configurations need not be located at successiveaddresses. Configuration 18 has no configuration data and therefore doesnot also have a pointer in the data memory.

A new configuration is written into the free list in the data memory. Indoing so the pointer information of the data memory is not modified.Only for the last data word of a configuration is the pointerinformation altered to indicate that the list is now being modifiedhere. The pointer to the free list points at the next entry.

It may happen that the space in the free list is not sufficient tocompletely accommodate the incoming configuration data. In this case, adecision must be made as to whether an old configuration is to beremoved from the data memory or whether the current configuration is notto be included in the cache. In the latter case, the subsequentconfiguration words are discarded. Since no pointer has been modified,the free list remains the same as before and only a few unused datawords have a different value. The decision as to which configurationshould no longer be in the cache (the oldest or the current) may be madeon the basis of the number of configuration words already written. Thereis little point in removing several cached configurations to make roomfor a long RAM initialization, for example.

If the oldest configuration is to be removed, it is removed from theFIFO. The pointer for the last entry in the free list is set at thevalue taken from the FIFO. After this address, configuration may becontinued in the accustomed manner.

FIG. 10 shows an example of this. Configuration no. 7 is to bereconfigured. FIG. 10 (a) shows the free list as completely full. Adecision is made to remove the oldest configuration (no. 5) from thecache and to write configuration no. 7 into the cache. To do so, thepointer is moved from the end of the free list to the start of formerconfiguration 5. The free list is thus lengthened again and space isagain available for new configuration words. The memory parts affectedin this step are shown with contradiagonal hatching in FIG. 10 (b). Witha suitable division of the memory, this may take place in one cycle.With the last configuration word, the corresponding pointer points atthe end and the free pointer points at the next entry. Space in the datamemory is then not only freed up again when needed by the inclusion of anew configuration, but also if the management memory is full andtherefore an entry is removed from the management memory, the free listin the data memory must be adapted. To do so, either the pointer at theend of the free list or at the end of the configuration being freed upis adapted. Both types of information are not yet available at thispoint. It is now possible to move through one of the lists untilreaching the end. However, this is time-consuming. As an alternative, anadditional pointer to the particular end of a configuration is stored inthe management memory. Modification is then easily possible. The freepointer receives the starting address of the old configuration, and thepointer at the last configuration word in the data memory points at thefree pointer.

This is illustrated in FIG. 11. The pointers to the configuration endsare shown with dashed lines. FIG. 11 (a) illustrates the situationbefore deletion, FIG. 11 (b) illustrates the situation afterwards.

10. Optimization of Bus Allocation

The buses are currently defined explicitly by the router. This mayresult in two configurations overlapping on a bus and therefore notbeing able to run simultaneously although on the whole enough buseswould be available.

It has been recognized that it does not matter in terms of the algorithmwhich bus carries a connection. Therefore, it is proposed that busallocation be performed dynamically by the hardware and the hardware beprovided with a suitable dynamic bus allocator. A configurationspecifies only that it needs a connection from point A to point B withina row. An arbiter in the hardware which is able to work per row eithervia proximity relationships in a distributed manner or at a centrallocation for the row then selects which of the available buses is infact used. In addition, buses may be dynamically rearranged. Two shortnon-overlapping buses which have been configured to different busnumbers on the basis of a previous allocation may be rearranged to thesame bus number when resources become available. This creates space forlonger connections in the future.

1. A method of data processing using a processor comprising areconfigurable field of data processing cells, the method comprising:configuring, by the processor, a first subset of the data processingcells, such that the first subset of the data processing cells has afirst configuration while one or more other subsets of the dataprocessing cells has, respectively, one or more other configurations;processing data, by the first subset of the data processing cells, whilethe first subset of the data processing cells is configured with thefirst configuration; monitoring, by the processor, whether a maximumallowed execution runtime of the first configuration is exceeded; andresponsive to determining, in the monitoring step, that the maximumallowed execution runtime of the first configuration is exceeded,removing, by the processor, the first configuration and the one or moreother configurations.
 2. The method of claim 1, where the maximumallowed execution runtime of the first configuration is determined bythe processor to be exceeded conditional upon a lapse of the maximumallowed execution runtime without the first subset of the dataprocessing cells requesting a new configuration.
 3. The method of claim1, wherein the first subset of the data processing cells is adapted to,while the first subset of the data processing cells is configured withthe first configuration, request a new configuration of one or more ofthe first subset of the data processing cells.
 4. A method of dataprocessing using a processor comprising a reconfigurable field of dataprocessing cells and a memory arrangement, wherein the memoryarrangement stores therein a data vector, the method comprising:sequentially reading, by the field using a first configuration of thefield, a first subset of data elements of the data vector; monitoringwhether a maximum allowed execution runtime of the first configurationis exceeded; responsive to determining in the monitoring step that themaximum allowed execution runtime is exceeded, removing the firstconfiguration and configuring the field with a second configurationprior to readout of all of the data elements of the data vector, suchthat a second subset of the data elements of the data vector remainsunread in the memory arrangement; and subsequent to the removing of thefirst configuration and the configuring of the field with the secondconfiguration, sequentially reading, by the field using the secondconfiguration, one or more data elements of the second subset of thedata elements.
 5. The method of claim 4, further comprising: for each ofthe sequentially read data elements, updating a pointer to point to adifferent memory location of the memory arrangement than prior to theupdating, wherein a beginning of the sequential reading of the one ormore data elements of the second subset of the data element is performedbased on a position into which the pointer entered while the field wasconfigured with the first configuration.